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when performing parallel addition on operands. . 

11. A multiplier which both implements whole word multiplication and Implements parallel multiplication of 
sub-word mulUpltcands, the multiplier comprising: 
5 partial product generation means (301**316) for generating partial products; 

partial product sum circuitry (320), coupled to the partial product generation means (301-316), for 
summing the partial products to produce a result; 

selection means (321) for selecting one of whole word multiplication and parallel multiplication of 
sub- word multiplicands; and, 
70 partial product selection means, coupled to the partial product generation means (301-316) and to 

the selection means (321), for in response to the selection means (321) selecting parallel multiplication 
of sub-word multiplicands forcing selected partial products to have a new value, thereby implementing 
parallel multiplication of sub-word multiplicands. 

16 12. A multiplier as in claim 11 wherein the partial product selection means, in response to the selection 
means (321) selecting parallel multiplication of sut>-word multiplicands, forces the selected partial 
products to have a value of 0. 

ia A multiDlier as in claim 12 wherein the partial product generation means (301-316) comprises an array 
20 of logic AND gates (301-316). each logic AND gate in the array of logic AND gates (301-316) 
gennratinQ a rvartial pr-'vluct. 

14. A m.ittipiii r a5. m ririim 13 wherein the partial product selection means comprises third inputs to at least 
a poftirn f .! iht. OQC AND gates (301-316). 

25 

15. A muiti.-^iM-ff ai in claim 12 wherein when the multiplier is implementing whole word multiplication, the 
partial fifj^Uirt sukrction moans does not force any partial products to have a value of 0. 

16. A mun»(.»» t if^ runm 1 1 herein the multiplier is a Booth-encoded multiplier. 

30 

17. A rrK.tr> t u i» (iufmr)c t)Oth multiplication of whole word multiplicands and parallel multiplication of 
sut>-wi »M mil lip i. afuii using a single hardware multiplier, the method comprising the steps of: 

iLi <^ a* partial prOduCtS; 

<t^) ri i*'u^^n*.* to d scldctton to perform parallel multiplication of sub-word multiplicands, forcing 
as st.'Ur. ti.'.i p^Miai p«a,iucts to have a new value; and. 

(c) ujrrmi vj ttu- partial products to produco a result, the summing performed using partial product 

Surn i iri'ilitrv (3I^» 

1& A mctr-,.nt a: tn cUiirTi \7 wherein step (b) includes in response to the selection to perfomn parallel 
40 mull p*i'. ator> oi suL>*word multiplicands, forcing selected partial products to have a value of 0. 

19. A mc;ir»i as m claim t8 wherein st^p (a) is performed using an anray of logic AND gates (301-316), 
each 1031C AND gate n the array of logic AND gates (301-316) generating a partial product. 

4$ 20. A method as in ciairT^ 19 wherein in step (b) forcing selected partial products to have a value of 0 is 
implemented by placing a logic 0 on inputs to a portion of the logic AND gates (301-316). 

21. A method as in claim 18 wherein in step (b) when the multiplier is implementing whole word 
multiplication, not forcing any partial products to have a value of 0. 

60 
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(51,81,121) to the third partition circuitry (91,131) when pertorming parallel operations on operands; 
and, 

third selection means, coupled between the third partition circuitry (91,131) and the fourth partition 
circuitry (101.141). for allowing data to propagate from the third partition circuitry (91.131) to the fourth 
6 partition circuitry (101.141) when performing operations on full word length operands, and for allowing 
prevention of data from propagating from the third partition circuitry (91.131) to the fourth partition 
circuitry (101.141) when performing parallel operations on operands. 

4. A functional unit as in claim 1 wherein the first selection means (50.80,120) includes means for 
70 forwarding a logic 0 to the second partition circuitry (51,81.121) when performing parallel additions on 
operands with bit lengths which are smaller than a bit length of the full word length operands, and for 
forwarding a logic 1 to the second partition circuitry (51.81.121) when perfonming parallel subtractions 
on operands with bit lengths which are smaller than a bit length of the full word length operands. 

16 5. A functional unit as in claim 1 wherein the functional unit comprises a canry look-ahead adder 
(60.61.65.66.69). 

6. A method for providing for parallel data processing within a single processor, the method comprising 
the steps of: 

20 (a) performing operations on a first set of bits from at least one operand in first partition circuitry 

(41,71,111): 

(b) performing operations on a second set of bits from the at least one operand in second partition 
circuitry (51.81.121): 

(c) when performing operations on full word length operands allowing data from the first partition 
25 circuitry (41.71.111) to effect the calculation of results by the second partition circuitry (51,81.121): 

and, 

(d) when performing parallel operations on operands, preventing data from the first partition circuitry 
(41,71.11 1) to effect the calculation of results by tiie second partition circuitry (51 £1.1 21). 

30 7. A method as in claim 6 wherein 

step (a) includes performing an addition operation on low order bits of the plurality of operands: 
step (b) includes an addition operation on high order bits of the plurality of operands; 
step (c) includes allowing a carry from ttie first partition circuitry (41.71,111) to effect the calculation 
of results by the second partition circuitry (51,81,121) when performing the addition on full word length 

35 operands; and. 

step (d) includes preventing a carry from the first partition circuitry (41.71,111) to effect the 
calculation of results by ttie second partition circuitry (51,81,121) when performing parallel addition on 
operands. 

40 8. A method as in claim 6 wherein 

step (a) includes performing a carry look-ahead addition operation on low order bits of the plurality 
of operands; and, 

step (b) includes performing a carry look-ahead addition operation on high order bits of the plurality 
of operands. 

45 

9. A method as in claim 6 wherein 

step (a) includes performing a subtraction operation on low order bits of the plurality of operands; 

and. 

step (b)includes performing a subtraction operation on high order bits of the plurality of operands. 

60 

10. A method as in claim 6 wherein 

step (a) includes performing a carry*propagate addition operation on low order bits of the plurality 
of operands; 

step (b) includes a carry-propagate addition operation on high order bits of the plurality of 
55 operands; 

step (c) includes allowing a carry from propagating to the second partition circuitry (51,81,121) 
when performing the addition on full word length operands; and, 

step (d) includes preventing a carry from propagating to the second partition circuitry (51,81.121) 

15 
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Kaufmann. 1990. Appendix, pp. A-39 through A-49. As. In the case of the multiplier above, the value of 
some partial product terms generated by the Booth-encoded multiplier are changed to take Into account the 
parallel processing, as will be understood by those skilled In the art. 

More specifically, for a Booth*encoded multiplier, the AND gates 301 through 316 shown in Figures 8 

6 and 9 are replaced by multiplexors. For example, a Booth-encoded multiplier using the "overlapping 
triplets" method examines three bits of the multipder (i.e., y multiplicand) each time, instead of one bit each 
time, to generate a row of partial products that is one of +x, ♦2X, -2x, -x or zero, Instead of a row of partial 
products which is always +x or 0 as in the multiplier shown in Rgures 8 and 9. This may be implemented 
as a fIve-to-one multiplexor. The name "overlapping triplets" is due to the fact that this method looks at 

70 three bits of the multiplier (y nriultiplicand) and retires two bits of the multiplier (y multiplicand) for each row. 
The overlapping occurs when, for the next row, the least significant bit of the three multiplier (y 
multiplicand) bits used by this next row was the most significant bit of the three multiplier bits used from the 
previous row. 

To implement parallel sub-word multiplication, the bits of the x multiplicand that do not correspond to 
n the sub-word product whose partial product rows are being formed are set to zero. This can be 
implemented wi(h multiplexors like in the unmodified Booth-encoded multiplier, modifying the control 
signals to the multiplexors. The sign of the partial product row may also be used as an additional input to 
the multiplexors. 

The foregoing discussion discloses and describes merely exemplary methods and embodiments of the 
20 present invention. As will be understood by those familiar with the art, the invention may be embodied in 
other specific forms without departing from the spirit or essential characteristics thereof. Accordingly, the 
disclosure of the present invention is intended to be illustrative, but not limiting, of the scope of the 
invention, which is set forth in the following claims. 

25 Claims 

1. A functional unit within a processing system, the functional unit comprising: 

first partition circuitry (41,71,111) which performs operations on a first set of bits from a plurality of 
operands; 

30 second partition circuitry (51.81,121) which perfomns operations on a second set of bits from the 

plurality of operands; and, 

first selection means (50,80,120). coupled between ttie first partition circuitry (41,71,111) and the 

second partition circuitry (51,81.121), for allowing data to propagate from the first partition circuitry 

(41.71,111) to the second partition circuitry (51.81.121) when performing operations on full word length 
3S operands, and for preventing data from propagating from the first partition circuitry (41.71,111) to the 

second partition circuitry (51,81.121) when performing parallel operations on operarids with bit lengths 

which are smaller than a bit lengtti of the full word length operands. 

2. A functional unit as In claim 1 wherein 

40 the first partition circuitry (41,71,111) performs an addition operation on low order bits of the 

plurality of operands, 

the second partition circuitry (51,81.121) performs the addition operation on high order bits of the 
plurality of operands, and 

the first selection means (50.80,120) is a selector which allows a carry to propagate from the first 
46 partition circuitry (41.71,111) to the second partition circuitry (51,81,121) when performing addition of 
full word length operands, and which prevents the carry from propagating from the first partition 
circuitry (41,71.111) to the second partition circuitry (51,81,121) when performing parallel additions on 
sub-word length operands. 

60 3. A functional unit as In claim 1 additionally comprising: 

third partition circuitry (91.131) which performs operations on a third set of bits from a plurality of 
operands; 

fourth partition circuitry (101,141) which performs operations on a fourth set of bits from the 
plurality of operands; 

66 second selection means (90,130), coupled between the second partition circuitry (51,81,121) and 

the third partition circuitry (91.131), for allowing data to propagate from the second partition circuitry 
(51.81.121) to tiie third partition circuitry (91.131) when performing operations on full word length 
operands, and for allowing prevention of data from propagating from the second partition circuitry 

14 
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A* r.;in ti £i.tvn from Table 11 above, in a first parallel multiplication, an eight-bit nnultiplicand 

25 AAAAAAAA. .. it -nuMipiicd by an eight-bit multiplicand DDDDDDDD(baM 2) to produce a sixteen-bit result 

GGGGCiGCiuGGCiGGGGG i^m* i). In a second parallel multiplication, a four-bit multiplicand BBBBowm z) is 
muUipit-: iw a ».Hji*tiit 'nuitiplicand EEEEdMso 2) to produce an eight-bit result HHHHHHHH(base 2). In a third 
paraiu 1 miinu*iH ati*rfi ci lour -bit multiplicand CCGC<base z) ts multiplied by a four-bit multiplicand FFFF(|mm zy 
to pro*ii«> i. af« nt-tNt csult itiinilcbase 2)- As will be understood by those skilled in the art. for every partial 

30 produ« r itv.^f* ir> ia:»». 1 1 with a value of zero, it is necessary to have a three input logic AND gate or its 
logic I 'll m (>' ^ii"*^ tnc partial product to be forced to zero when performing parallel multiplication 
operatK>n« H.f4¥f v/t r whc-fi a mixture of different sized partitions are done, as in Table 11. in some 
implcrii>niatMuit oifti<ii.;ni control inputs may be needed to force different partial product terms to zero, as 
will \>- iif iV '^t.H -1 1*> tu.iso of ordinary skill in the art. 

35 A! m,iv i»» ur iii 'sti/^xt irom the above-discussion, by selectively forcing partial products of a multiplier 
to zero (»»iiaiu mijita»iu:aion of partial words may be filly implemented In a multiplier. The size of the word, 
the ni.mU!r oi rorarii t nrminpiications simultaneously performed and the size of the partial words may be 
freely vafK-ti m ac 'ori.iancc with the teaching of the present Invention. 

Figvirv; 11 srn-.w* au example of instructions that can be executed In accordance with the preferred 

40 embociifTv.nt of Vmj rucsoni invention. For example, instruction 500 includes a field 501. a subfleld 502 of 
field 501 a Uc<i 503 a field 504 and a field 505. Field 501 sets out the op code. Field 501 sets out. for 
examf.ic an add a jhih and add, a subtract, a shift and subtract, a shift left, a shift right, a multiply, or any 
numl>ui oi othoi opcraiiors Subfleld 502 of 501 indicates whether the operation is to be performed as 
parallel ocicrations. and it so. what is the size of the operands. Field 503 indicates a first source register. 

45 Field 504 indicates a serond source register. Field 505 indicates a destination register. 

As wiW be unccrstood m the art. instruction 500 illustrates one of many possible ways an instruction can 
be organized, hot example, instruction 510 shows an alternate embodiment In that the parallel operation 
indication is in a separate field. Specifically, instruction 510 includes a field 511. a field 512. a field 513. a 
field 514 and a field bib. Field 511 sets out the op code. Field 511 sets out, for example, an add, a shift 

50 and add, a subtract, a shift and subtract, a shift left, a shift right, a multiply, or any number of other 
operations. Field 5i2 indicates whether the operation is to be performed as parallel operations, and If so. 
what is the size of the operands. Field 513 indicates a first source register. Reld 514 Indicates a second 
source register. Field 515 indicates a destination register. 

As will be understood in the art. the present Invention also works for other multipliers where partial 

55 products are generated. For example, the present invention also may be utilized in a Booth-encoded 
mulllpller. In a Booth-encoded multiplier, fewer rows of partial product terms are generated by considering 
more than one bit of the multiplier (y-multiplicand) for each row of the partial product term. See for 
example. John Hennessy & David Patterson, Computer Architecture. A Quantitative Approach , f^organ 
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As can b© seen from Table 10 above, in a first parallel multiplication, an eight-bit multiplicand 
26 AAAAAAAA<b«sa 2) is multiplied by an eight-bit multiplicand CCCCCCCC(ba!« 2) to produce a sixleen-bit result 
EEEEEEEEEEEEEEEE(ba,o 2>. In a second parallel multiplication, an eight-bit multiplicand BBBBBBBB(bata 2) 
is multiplied by an eight-bit multiplicand DDDDDDDD(baM 2) to produce a sixteen-bit result 
FFFFFFFFFFFFFFFF(baso 2). Multiplication of two whole word (sixteen-bit) multiplicands is implemented by 
the multiplier by not forcing any of the partial products to zero. 
30 While the above description has showed parallel multiplication of half words, it will be understood by 
persons of ordinary skill in the art that both the number of parallel multiplications performed and the size of 
the partial word may be varied by selecting the appropriate partial products to force to zero. 

For example, the sixteen-bIt multiplier implemented as described by Table 4 (and/or Table 10) may be 
utilized to perform three simultaneous parallel multiplications by providing circuitry, such as shown in Figure 
35 8 and Figure 9, to force partial products to zero in accordance with the teaching of the present invention. 
Thus, modifying the multiplier described by Table 4 in accordance with the teaching of the present 
invention allows, for example, the performance of one parallel multiplication using eight-bit multiplicands 
and two parallel multiplication using four-bit multiplicands as implemented by Table 11 below: 
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As Illustrated by Table 7 and Table 8. parallel multiplication of partial words is implemented in a 
multiplier by forcing selected partial products in the multiplier to zero. In general, a standard multiplier of 
any size may be utilized to perform parallel multiplication by forcing unused partial products to zero. The 
partial products are forced to logic 0, for example, by using one or more control inputs and three input logic 
AND gates (or their equivalents). 

For example, as discussed above, an eight-bit multiplier may be implemented as described by Table 3. 
This same multiplier may be utilized to perform parallel multiplication of partial word multiplicands by 
providing circuitry, such as shown in Figure 8 and Rgure 9. to force partial products to zero In accordance 
with the teaching of the present invention. No modification is necessary to the partial product sum circuitry. 
Thus, modifying the multiplier described by Table 3 in accordance with ttie teaching of the present 
invention allows, for example, the performance of two parallel multiplications using four-bit multiplicands as 
implemented by Table 9 below: 



As can be seen from Table 9 above, in a first parallel multiplication of partial word multiplicands, a four- 
bit multiplicand AAAA(b«o 2) is multiplied by a four-bit multiplicand CCCC(ba5a z) to produce an eight-bit result 
EEEEEEEE<base 2). In a second parallel multiplication of partial word multiplicands, a four-bit multiplicand 
BBBB<ba8» 2) is multiplied by a four-bit multiplicand DDDD^baso 2) to produce an eight-bit result FFFFFFFF(ba8e 
2). Multiplication of two whole word (eight-bit) muKiplicands is implemented by tiie multiplier by not forcing 
any of the partial products to zero. 

Likewise, as discussed above, a sixteen-bit multiplier may be implemented as shown by Table 4. This 
same multiplier may be utilized to perform parallel multiplications of partial word multiplicands by providing 
circuitry, such as shown in Figure 8 and Figure 9. to force partial products to zero in accordance with the 
teaching of the present invention. No moditication Is necessary to the partial product sum circuitry. Thus, 
modifying the multiplier described by Table 4 in accordance with the teaching of the present invention 
allows, for example, the performance of two parallel multiplications using eight-bit (partial word) multipli- 
cands as implemented by Table 10 below: 
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A comparison of Table 5 with Table 1 above, confirms that when line 321 is set to logic 1. operation of 
the multiplier shown in Figure 8 is identical to operation of the multiplier shown in Figure 7. Therefore, 
T6 similar to Table 2 above, the simplified notation may be used to describe operation of the multiplier shown 
in Rgure 8 as in Table 6 below: 

Tables 

20 

X X X X 



Z 2 Z Z Y 
2 Z 2 Z Y 
Z 2 Z 2 Y 
25 Z Z Z Z X 



ZZZZZZZZ 



Figure 9 shows the multiplier shown in Rgure 8, except however that control line 321 is set at logic 0. 

ao This forces half the partial products to zero allowing the multiplier to perform parallel multiplication of partial 
(two-bit) words. That is. In a first multiplication, a two-bit multiplicand Ai Ao (base 2) is multiplied by a two- 
bit multiplicand CiCo (base 2) to produce a four-bit result E3E2EiEo (base 2). In a second multiplication, a 
two-bit multiplicand Bi Bo (base 2) is multiplied by a two-bit multiplicand Di Do (base 2) to produce a four-bit 
result FaFaFiFo (base 2). The partial products not used for the parallel multiplications are forced to logic 

35 zero. The parallel multiplication may be represented in table form as shown in Table 7 below: 



— — ai — afi 

0 0 DQBl DQBQ Do 

0 0 DlBl DlBO Dl 

Com cqaq 0 0 Co 

Ciiki^iAil Q Q Ci 

E3 E2 El EO F3 F2 Fl FQ 



Using the simplified notation first introduced in Table 2, the multiplier shown in Figure 9 may be 
50 represented as In Table B below: 



55 
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10 



IB 



20 



25 



The multiplier shown in Table 3 multiplies an eight-bit first multiplicand XXXXXXXX<i»a8e 2) with an eight- 
bit second multiplicand YYYYYYYY<ba8o2) to produce an sixteen-bit result ZZZZZZZZZZZZZZZZ<bB8e 2). 

Similarly, using the simpler notation of Table 2 and Table 3 (but eliminating spaces between bit 
positions) a sixteen-blt multiplier may be described as shown in Table 4 below: 

lahkJL 

xxxxxxxxxxxxxxxx 
zzzzzzzzzzzzzzzz Y 
zzzzzzzzzzzzzzzz Y 
zzzzzzzzzzzzzzzz Y 
zzzzzzzzzzzzzzzz Y 
zzzzzzzzzzzzzzzz Y 
zzzzzzzzzzzzzzzz Y 
zzzzzzzzzzzzzzzz Y 
zzzzzzzzzzzzzzzz Y 
zzzzzzzzzzzzzzzz Y' 
zzzzzzzzzzzzzzzz Y 
zzzzzzzzzzzzzzzz Y 
zzzzzzzzzzzzzzzz Y 
zzzzzzzzzzzzzzzz Y 
zzzzzzzzzzzzzzzz Y 
zzzzzzzzzzzzzzzz Y 

g2Z22ZZ;SZ222ZZZZ Y 

ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ 



The multiplier shown in Table 4 multiplies a sixteen-Wt first multiplicand XXXXXXXXXXXXXXXX(b.«i 2) 
30 with a sixteen-bit second multiplicand YYYYYYYYYYYYYYYY(toM 2) to produce a thirty-two-bit result 
ZZZ2ZZZZZZZ2ZZZZZZZZZZ2222ZZZZZZ(baso 2). 

In accordance with prefened embodiments of the present invention, a standard multiplier may be 
modified to implement a multiplier which provides parallel multiplication of partial words in addition to 
multiplication of whole words. For example. Figure 8 shows a four-bit multiplier In accordance with the 
35 preferred embodiment of the present invention. Logic AND gates 301, 302. 303. 304. 305, 306. 307, 308. 
309. 310. 311. 312, 313, 314, 315 and 316 generate partial products for the multiplication. A partial product 
sum circuit 320 sums the partial products generated by logic AND gates 301 through 316 to produce the 
result. 

In the multiplier shown in Rgure 8. partial product sum circuit 320 may be implemented exactly the 
40 same as partial product sum circuit 220 shown in Figure 7. The difference between the multiplier shown in 
Figure 8 and the multiplier shown in Rgure 7, Is the addition of a control line 321, which Is connected to an 
additional input included in each of logic AND gates 303. 304, 307. 308. 309, 310. 313 and 314. 

As shown in Figure 8, when control line 321 is set at logic 1, the multiplier performs a whole word 
multiplication on a four-bit first multiplicand X3X2Xi)C) (base 2) and a four-bit second multiplicand Y3Y2Yi Yo 
45 (base 2) to produce an eight-bit result 27 26Zs2423222i2o (base 2). The two multiplicands, X3X2XiXo and 
Y3Y2Yi Yo, the partial products generated by logic AND gates 301 through 316. and the result produced by 
partial product sum circuit 320 may be represented in table form as shown in Table 5 below: 



60 
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For example, Figure 7 shows a four-bit multiplier In accordance with the prior art. The multiplier 
multiplies a four-bit first murtlpllcand X3X2X1X0 (base 2) with a four-bit second multiplicand Y3Y2YiYo (base 
2) to produce an elght-bit result ZjZsZsZAZ^ZzZyZo (base 2). As is understood by those skilled In the art. 
logic AND gates 201. 202. 203, 204, 205. 206. 207, 208. 209. 210, 211. 212, 213, 214. 215 and 216 may be 
5 used to generate partial products for the multiplication. A partial product sum circuit 220 sums the partial 
products generated by logic AND gates 201 through 216 to produce the result. 

The two multiplicands, X3X2XiXo and YaYaViYo. the partial products generated by logic AND gates 201 
through 216, and the result produced by partial product sum circuit 220 may be placed in a table in such a 
way as to summarize operation of the multiplier. For example, such a table is shown as Table 1 below: 

ZahisJ. 



YOX3 Y0X2 YoXl YQXO Yq 
Y1X3 Y1X2 YlXi YiXO Yi 
Y2X3 Y2X2 Y2X1 Y2X0 Y2 

XaXi^l2i2.^lXl^lXll 

Z? 26 Z5 24 Z3 Z2 Zl ZO 



In the notation used in Table 1 above, the bit position of each bit of both multiplicands and the result is 
specifically identified. Additionally, the bits of the multiplicand which are used to form each partial product 
are specifically set out. As is understood by those skilled in the art. the information shown in Table 1 above 
25 may be set out using abbreviated or simplified notation, as In Table 2 below: 



Table 2 

30 X X X X 

z z z z Y 
z z z z Y 
z z z z Y 
2 a Z a 2L 



95 ZZZZZZZZ 



In Table 2 above, each bit of the first multiplicand is represented by an "X", each bit of the second 
multiplicand is represented by a "Y", each bit of a partial product is represented by a "z**, and each bit of 
40 the result is represented by a "Z". Using the simpler notation of Table 2. an eight-bit multiplier may be 
described as shown in Table 3 below: 

45 

xxxxxxx:?^ 

ZZZZZZZZ Y 
ZZZZZZZZ Y 

ZZZZZZZZ Y 

50 ZZZZZZZZ Y 

ZZZZZZZZ Y 

ZZZZZZZZ Y 

ZZZZZZZZ Y 

ZZZZZZZZ ^ 

55 ZZZZZZZZZZZZZZZZ 
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Gs[il 



Equations 
= (M [i] * F) + (!M [i] * (A[i] * B[i])) 



5 



-(M[i]*F) + (!M[i]*G[i]) 



Ps[i] 



=r Pm[i] 



10 



Now if M[i] is 1, the value of Gs[i] is determined by F. If M [i] is 0 the value of Gs[i] is determined by A[i] 
and B[i] as it was previously. The propagate does not have to be forced by the F signal. 
The equation of the carry out is given by Equation 6 below: 

IS Equation 6 C(i] = Gs [i] + Ps [i] ' C[i-1] 

As will be understood by persons of skill in the art, principles of the present invention are not confined 
to arithmetic operations within computer system ALUs. For example, partitioning as shown in the ALU may 
also be extended to other entities within the computer system which operate on data. For example. Rgure 6 

20 shows the present invention embodied in pre-shifter 27. The same embodiment of the present Invention 
may also t>e used to implement shifter 29. Partitioning of pre-shifter 27 and shifter 29 allows, for example, 
for the Implementation of parallel shift-and-add operations and parallel shift operations 

Pre-shifter 27 is shown to include a shift register one-bit slice . 160. a shift register one-bit slice 161. a 
shift register one-bit slice 165. a shift register one-bit slice 166 and a shift register one-bit slice 169. 

25 When data is shifted to the left, a datum on input 171. typically a logic 0 value. Is used as input to shift 
register one-bit slice 160. When data is shifted to the right, a selector 175 in response to a control Input 182 
selects either a datum on input 181 (a logic 0 value or a logic 1 value) or selects the value cun^entty stored 
by shift register one-bit slice 169 to be input to shift register one-bit slice 169. 

Wherever the shifter is to be partitioned, additional selectors are added to the shifter. For example, 

30 Figure 6 shows the shifter partitioned between shift register one-bit slice 165 and shift register one-bit slice 
166. There a selector 174 and a selector 173 have been added. For shift operations on partitioned 
operands, when data Is shifted to the left, selector 173. in response to a control input 185, selects a datum 
on input 172, typically a logic 0 value, to be used as Input to shift register one-bit slice 166. For shift 
operations on full word operands, when data is shifted to the left, selector 173 selects output from shift 

35 register one*bit slice 165 to be used as input to shift register one-bit slice 166. 

For shift operations on partitioned operands, when data is shifted to the right, selector 174 in response 
to a control input 184 selects either a datum on input 183 (a logic 0 value or a logic 1 value) or selects the 
value currently stored by shift register one-bit slice 166 to be Input to shift register one-bit slice 165. For 
shift operations on full word operands, when data is shifted to the right, selector 174 selects output from 

40 shift register one-bit slice 166 to be used as Input to shift register one-bit slice 165. 

Figure 6 shows a shifter with only two partitions. As will be understood from the foregoing discussion of 
partitions in an ALU, the shifter can be partitioned in a variety of ways. For example, a 64-bit shifter may be 
partitioned into two, four, eight, sixteen, thirty-two or sixty-four bit equal size partitions. Additionally, It is not 
a requirement of the present invention that partitions each operate on equal number of bits. 

45 Wile the above embodiment describes the pre-shifter 27 and shifter 29 Implemented as a shift register 
consisting of a series of one bit slices, alternative preferred embodiments are pre-shlfters and shifters 
implemented with multiplexors. Typically, pre-shifter 27 is implemented by a one level of multiplexors, since 
it can usually shift by at most a small number of bits, tor example, 0, l, 2, 3 or 4 bits. Shifter 29 is typically 
Implemented by three levels of multiplexors, where each level of multiplexor is a four-to-one muKiplexor. 

60 For example, in a 64-bit shifter 29. the first level of multiplexors will shift either 0. 16, 32 or 48 bits. The 
second level of multiplexors can shift either 0. 4. 8 or 12 bits. The third level of multiplexors can shift 0. 1. 2 
or 3 bits. This gives a shift of any number of bits from 0 to 63. In such a shifter built up of 3 stages of 
multiplexors, one-bit slices can still be Identified. However the blocking of the shifts between any two bits 
may need to be done In one or more of the three multiplexer stages, as will be understood by those of 

66 ordinary skill in the art. 

Principles of the present invention may also be extended to other elements in a computer system. For 
example, a multiplier may be implemented in accordance with a preferred embodiment of the present 
invention to allow for partial word parallel multiplications in addition to whole word multiplications. 
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bit Zi. A full adder 465 receives a single bit X|-i of the first operand, a single bit Y|.t of the second 
operand and a carry bit C|.2. Full adder 465 produces a sum bit Z|.i. A full adder 466 receives a single bit 
X| of the first operand, a single bit Yi of the second operand and a carry bit C|«i. Full adder 466 produces a 
sum bit Z|. A full adder 469 receives a single bit of the first operand, a single bit Yj.i of the second 

s operand and a carry bit Cj.2. Full adder 469 produces a sum bit Z|.i. 

In the embodiment of the adder shown in Figure 10, "j" is the size of the data path and the bit length of 
full word operations. Also, "i" Is equal to "j" divided by 2. For example, "j" is equal to 32 and "i** is equal to 
16. Alternately, when j is equal to 32, i may be equal to any integer Jess than 32. 

When performing operations using "j''-bit full word operands, an enable bit 452 is equal to logic one 

10 and allows all carries to propagate. When performing two parallel operations using "i"rbit sub^^ord 
operands partitioned between bits i and i 1 , enable bit 452 is equal to logic zero and prevents the canry 
propagating across the partition boundary. Instead the value on line 451 is used as the value forwarded to 
full adder 466. When an '*add*' is being performed, a logic 0 is placed on input line 451. When a "subtract" 
is being performed, a logic 1 is placed on input line 451. 

16 Operation of carry look-ahead adders are well understood in the art. For example, suppose A[i] is one 
bit of an input, B[i] is one bit of the other input and S[i] is one bit of the sum from the adder. Then, the sum 
from one bit of the adder is given by Equation 1 below: 

Equation 1 SII1« A[i] XOR.B[i] XOR C[i-11 

20 

In equation 1. C[i-1] is the carry out of the previous bits of the carry look-ahead adder. The carry look- 
ahead adder works on generating these carry bits quickly. 

Let G[i] be a signal which signifies that a carry is to be generated by this bit and Pfi] be a signal that a 
carry may propagate from the previous bits to the output of this bit. These are determined in accordance 
25 with Equation 2 below: 

Equation 2 G[i] = A[i] AND B[i]: P[il = A[i] OR B[i]: 

Therefore, for four bits within a carry look-ahead adder, the carry bits may be generated as in Equation 
30 3 below: 

Equations 

C[i] = G[i] P[i] ♦ (G[i.l] + P[i.l] * (G[i.2] ^ P[i-2] * (G[i-3] + P[i.3] ♦ C[i-41))) 
3« C[i-1] ^ G[i-1] + P[i.l] * (G[i-2] + Pti-2] * (GD-3) + P[i-3] * C[i^])) 

C[i.2] = G[i-2] + P[i.2] ♦ (GCi-3] + P[i-3] * C[i.4]) 

C[i-3] = G[i-3] + P[i-3]*C[i-4] 

40 In equation 3 above. ***** is equivalent to a logic AND operation and "-t-** is equivalent to a logic OR 
operation. 

When implementing a prefenred embodiment of the present invention, a carry is stopped at a particular 
bit if the Generate G[i] and Propagate P[i] are forced to be false. For instance, in equation 3 above, if G[i-31 
and P[i-3] are false. C[i-31 will be false and C[i-4] can never effect the value of C[i-2], C[i-1 ], and C[i]. 
45 Likewise, if G[i-2] and P[i-2] are false. C[i-21 will be false and G[l-3] and Pp-3] and C[i-4] can never effect 
the value of C[i-1] and C[i]. 

If we let M [I] be a mask bit that breaks the carry-chain between bit [1] and bit [i-^-l] when M [I] is 1, 
then a new Equation 4 can be generated as follows: 

50 Equation 4 Gm[l] ° IM [i] * (A[i] * B[i]) Pm[i] ^ \M [i] ' (A[i] B[i]) 

Now If M p] Is 1 . a carry will not toe allowed to be generated from bit [I] or to propagate through bit [I]. 
For subtraction by creating the one's complement of one of the operands and adding It to the other 
operand with a carry In (two's complement arithmetic), a carry must be forced to be generated In a bit when 
56 M[i] is 1 . 

Let F be a signal that when true will force a carry to be generated in a bit when M [i] Is 1. The equation 
for Gs[l] and l^[i] becomes as set out in Equation 5 below: 



6 



BNSDOCID: <EP 



0654733A1 I > 



EP 0 654 733 A1 



For example, in a computer which has a sixty-four bit wide data path, each full-word operand Is 64 bits. 
Therefore, when performing operations using 64-bit full word operands, selector 80 allows Information to 
propagate from first partition 71 through selector 80 to second partition 81. selector 90 allows information to 
propagate from second partition 81 through selector 90 to third partition 91, and selector 100 allows 

5 information to propagate from third partition 91 through selector 100 to fourth partition 101. When 
performing two parallel operations using 32-bit half word operands, selector 80 allows information to 
propagate from first partition 71 through selector 80 to second partition 81 . selector 90 prevents information 
from propagating from second partition 81 through selector 90 to third partition 91. and selector 100 allows 
information to propagate from third partition 91 through selector 100 to fourth partition 101. When 

10 performing four parallel operations using 16-btt quarter word operands, selector 80 prevents information 
from propagating from first partition 71 through selector 80 to second partition 81, selector 90 prevents 
information from propagating from second partition 81 through selector 90 to third partition 91, and selector 
100 prevents information from propagating from third partition 91 through selector 100 to fourtii partition 
101. 

15 Figure 5 shows an another alternate simplified block diagram of ALU 26 in accordance with anotiier 
alternate preferred emtx>diment of the present invention. In Figure 5. ALU 26 is divided into partitions which 
are each one bit wide. A first partition 1 1 1 performs operations on a low order bit 112 of a first operand and 
on a low order bit 113 of a second operand to produce a low order result bit 114. A second partition 121 
performs operations on a bit 122 of the first operand and a bit 123 of the second operand to produce a 

20 result bit 124. A partition 131 performs operations on a bit 132 of the first operand and a bit 133 of the 
second operand to produce a result bit 134. A partition 141 performs operations on a bit 142 of the first 
operand and a bit 143 of the second operand to produce a result bit 144. A partition 151 performs 
operations on a high order bit 152 of the first operand and a high order bit 153 of the second operand to 
produce a high order result bit 1 54. 

25 In response to a control input 119. a selector 120 is used to allow Information on data path. 115 to 
propagate from first partition 111 to second partition 121 or to intercept information on data path 115 before 
it is propagated from first partition 111 to second partition 121. When data is intercepted the value on a line 
128 is forwarded to partition 121 . When an "add" is being performed, a logic 0 is placed on line 128. When 
a "subtract** is being pertomned. a logic 1 is placed on line 128. 

30 In respxDnse to a control input 129. a selector 130 Is used to allow information on a data path from an 
immediately prior partition (not shown) to propagate from the immediately prior partition to partition 131 or 
to intercept information on tiie data path from the immediately prior partition before it Is propagated to 
partition 131. When data is intercepted the value on a line 138 is forwarded to partition 131. When an "add" 
is being performed, a logic 0 is placed on line 138. When a "subtract" is being performed, a logic 1 is 

35 placed on line 138. 

In response to a conti-ol input 139, a selector 140 is used to allow information on data path 135 to 
propagate from partition 131 to partition 141 or to intercept information on data path 135 before it is 
propagated from partition 131 to partition 141. When data is intercepted the value on a line 148 is fonvarded 
to partition 141. When an "add" is being performed, a logic 0 is placed on line 148. When a "subtract" is 

40 being performed, a logic 1 is placed on line 148. 

In response to a control input 149. a selector 150 is used to allow Information on a data path from an 
immediately prior partition (not shown) to propagate from the immediately prior partition to partition 151 or 
to Intercept information on the data path from the immediately prior partition before it is propagated to 
partition 151. When data Is intercepted the value on a line 158 is fonwarded to partition 151. When an "add" 

45 is being pertormed, a logic 0 is placed on line 158. When a "subtract" is being performed, a logic 1 is 
placed on line 158. 

The control inputs to the selectors may be used to allow parallel processing of operands of varying 
length. For example, in a processing system with a sixty-four bit wide data path, the control inputs could be 
selected so that parallel processing of two sixteen bit and four eight-bit arithmetic operations are all 

50 performed simultaneously. Additionally any bit combination which add up to no more than the word size 
could be used. For example, parallel processing of seventeen bit, three bit. sixteen bit. twelve bit. five bit, 
and eleven bit arithmetic operations can also t>e performed simultaneously. 

The principles discussed above also apply to a carry look-ahead adder. For example. Figure 10 shows 
implementation of a two's complement adder with can-y look-ahead within ALU 26 In accordance with 

55 another preferred embodiment of the present Invention. A carry look-ahead circuit 470 produces carries for 
the adder. A half adder 460 receives a single bit Xo of a first operand and a single bit Yo of a second 
operand. Half adder 460 produces a sum bit Zo. A All adder 461 receives a single bit Xi of the first 
operand, a single bit Yi of the second operand and a carry bit carry bit Co. Full adder 461 produces a sum 

5 



NSDOCIO: <EP ^0e5473aAl J.> 



EP 0 654 733 A1 



Figure 3 shows Implementation of a two's complement adder with carry propagate addition within ALU 
26 In accordance with a preferred embodiment of the present Invention. Alternately, ALU 26 Includes a 
two*s complement adder with carry look-ahead. A half adder 60 receives a single bit >^ of a first operand 
and a single bit Yo of a second operand. Half adder 60 produces a sum bit Zo and a carry bit Co. A full 

5 adder 61 receives a single bit Xi of the first operand, a singlie bit Yi of the second operand and carry bit 
Co. Full adder 61 produces a sum bit Zi and a canry bit Cr. A full adder 65 receives a single bit X|-t of the 
first operand* a single bit Y|.t of the second operand and a canry bit from a previous adder (I.e., C|.2. not 
shown). Full adder 65 produces a sum bit Z|-i and a cany bit Cj.t. A full adder 66 receives a single bit X| 
of the first operand and a single bit Y| of the second operand. Depending on a value of enable bit 49, full 

10 adder 66 also receives, through selector 50 (or equivalent logic circuitry as will be understood by persons of 
ordinary skill in the art), canv bit C|-|. Full adder 66 produces a sum bit Z| and a carry bit C|. A full adder 
69 recoivos a single bit Xj-i of the first operand, a single bit Yj.^ of the second operand and a carry bit 
from a previous addor (not shown). Full adder 69 produces a sum bit Zj-i and a carry bit Cj-i. 

In the embodinDont of the adder shown in Rgure 3, "j" is the size of the data path and the bit length of 

IS full word operations. Also, "i" is equal to "j- divided by 2. For example, "j" Is equal to 32 and "i" is equal to 
16. 

Selector 50 is also shown in Rgure 3. When performing operations using T'-bit full word operands, 
enable bit 49 is equal to logic one and allows a carry to propagate through selector 50 to full adder 66. 
When pertcrming tv^o parallel operations using "i"-bit half word operands, enable bit 49 is equal to logic 

20 zero and provonis iho carry from propagating through selector 50 to full adder 66. Instead the value on line 
59 is rorMf.iiri*>n n f ill addfir 66. When an "add" is being performed, a logic 0 is placed on input line 59. 
When a "suhtrarf 15: homg pftrformed. a logic 1 is placed on input line 59. 

Whilo Fujuffjf. 2 anrt 3 discuss implemeritations of ALU 26 with two partitions, an ALU designed in 
accordarci. ^rty otno» protcrrod embodiments of the present invention may variously partition an ALU. For 

25 example F^jufu 4 sho^t an alternate simplified block diagram of ALU 26 in accordance with an alternate 
preferrec frni. - jimi.-ni of \hc present invention. In Figure 4. ALU 26 is divided into four partitions. A first 
partition 7i («ff'>-m^ ooctations on low order bits 72 of a first operand and low order bits 73 of a second, 
operand tu :»t«ii.ii % u.*^ ufdor bit results 74. A second partition 81 performs operations on bits 82 of the first 
operand i^r*! :m!i e3 t^t second operand to produce result bits 84. A third partition 91 performs 

so operation* cm Ihm 9? the lirst operand and bits 93 of the second operand to produce result bits 94. A 
fourth pcfM .wi tci t».fii»fm£ operations on high order bits 102 of the first operand and high order bits 103 
of the socitrvi iHA-ranti *u (Noduce high order bit results 104. 

In tvsi'nyi^ ti« a -.ofittoi input 79, a selector 80 is used to allow information on data path 75 to 
propagate from tirst :»*Mtiiion 71 10 second partition 81 or to intercept infonmation on data path 75 before it is 

35 propagaiLxi tn-m srsi p^trinon 71 10 second partition 81. Particularly, for arithmetic operations performed on 
full-word oi»:rcfvi: ijt nati-w^rO'd Operands, information is allowed to propagate from first partition 71 through 
selecloi 80 tr» si^ r.n.t partition 81. For the performance of parallel arithmetic operations on quarter-word 
operands Si.-u.cicr 80 r>rc/uri5 information from propagating from first partition 71 to second partition 81. 
Instead th«- vakH; o*y a iint* 88 15 forwarded to partition 81. When an *'add" is being performed, a logic 0 is 

40 placed on iin«; 88 Whofi L •subtract" is being performed, a logic 1 Is placed on line 88. Generally, In logic 
operations iiu.fc r no r»fODai;)ation of information between partitions. 

In lespanbc to a conitoi input 89. a selector 90 Is used to allow Infonmation on data path 85 to 
propagate VofT\ second pariiion 81 to third partition 91 or to intercept Information on data path 85 before it 
is propagated from socond partition 81 to third partition 91. Particularly, for arithmetic operations performed 

45 on fuli-wDfd opcrancs. inforfT^aiion is allowed to propagate from second partition 81 through selector 90 to 
third partition 91. For the performance of parallel arithmetic operations on quarter*word operands or half- 
word operands, selector 90 prevents Information from propagating from second partition 81 to third partition 
91. Instead the value on a line 98 is forwarded to partition 91. When an "add" is being performed, a logic 0 
Is placed on tine 98. When a "subtract" is being performed, a logic 1 is placed on line 98. 

60 In response to a control input 99, a selector 100 is used to allow information on data path 95 to 
propagate from third partition 91 to fourth partition 101 or to intercept Information on data path 95 before It 
Is propagated from third partition 91 to fourth partition 101. Particularly, for arithmetic operations performed 
on full-word operands and half-word operands, information is allowed to propagate from third partition 91 
through selector 100 to fourth partition 101. For the performance of parallel arithmetic operations on quarter- 

65 word operands, selector 100 prevents information from propagating from third partition 91 to fourth partition 
101. Instead the value on a line 108 is forwarded to partition 101. When an "add" is being performed, a 
logic 0 is placed on line 108. When a "subtract" is being performed, a logic 1 is placed on line 108. 
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Figure 2 shows a simplified block diagram of an arithmetic logic unit (ALU) shown in Rgure i in 
accordance with a preferred embodiment of the present invention. 

Rgure 3 shows an implementation of a two's complement adder within the ALU shown in Rgure 2 in 
accordance with a preferred embodiment of the present invention. 
5 Rgure 4 shows an alternate simplified block diagram of the arithmetic logic unit (ALU) shown in Rgure 
1 in accordance with an alternate preferred embodiment of the present invention. 

Rgure 5 shows another alternate simplified block diagram of the arithmetic logic unit (ALU) shown in 
Rgure 1 in accordance with another alternate preferred embodiment of the present invention. 

Rgure 6 shows an implementation of a shifter shown in Rgure 1 in accordance with a preferred 
10 emt>odlm6nt of the present invention. 

Rgure 7 shows a multiplier in accordance with the prior art. 

Rgure 8 and Rgure 9 show a multiplier implemented In accordance with preferred embodiments of the 
present invention. 

Rgure 10 shows an implementation of a carry look-ahead adder within the ALU shown In Rgure 1 in 
76 accordance with an alternate preferred embodiment of the present invention. 

Rgure 11 shows an example of an instruction layout in accordance with an alternate preferred 
embodiment of the present invention. 

Description of the Preferred Embodiments 

20 

Rgure 1 shows a simplified block diagram of an operation execution data path within a processor In 
accordance with preferred embodiments of the present Invention. Operands for upcoming operations and 
results from accomplished operations are stored within general registers 25. \Nher\ operations are per- 
formed, a first operand stored in a first register within general registers 25 is placed on a first source bus 

25 21. If the operation requires another operand, a second operand stored In a second register within general 
registers 25 is placed on a second source bus 22. 

After performance of the operation, tiie result is placed on a result bus 23 and loaded Into a register 
within genera! registers 25. The operation is performed by arithmetic logic unit (ALU) 26 or by a shifter 29. 
A pre-shifter 27 and complement circuitry 28 may each be used to modify operands before they are 

30 received by ALU 26. For general t)ackground about the architecture of single processor systems con- 
structed similarly to the present invention see. for example, Ruby B. Lee, Precision Architecture . IEEE 
Computer, Volume 22, No. l , January 1 989. pp. 78-91 . 

In accordance with the preferred embodiments of tiie present invention, the ALU may be partitioned to 
allow parallel data processing. For example. Figure 2 shows ALU 26 divided into two partitions. A first 

35 partition 41 performs operations on low order bits 42 of a first operand and low order bits 43 of a second 
operand to produce low order bit results 44. A second partition 51 performs operations on high order bits 52 
of the first operand and high order bits 53 of the second operand to produce high order bit results 54. 

In response to a control input 49, a selector 50 is used to allow information on data patii 45 to 
propagate from first partition 41 to second partition 51 or to intercept infonmation on data path 45 before it is 

40 propagated from first partition 41 to second partition 51. Particularly, for arithmetic operations performed on 
full-word operands, information is allowed to propagate from first partition 41 through selector 50 to second 
partition 51. For the perfonmance of parallel arithmetic operations on half-word operands, selector 50 
prevents information from propagating from first partition 41 to second partition 51. Generally, in logic 
operations, there is no propagation of Information from first partition 41 to second partition 51. 

45 For example, in a computer which has a thirty-two bit wide data path, each full-word operand is 32 bits. 
Therefore, when performing operations using 32-bit full word operands, selector 50 allows information to 
propagate from «rst partition 41 through selector 50 to second partition 51 . When performing two parallel 
operations using 1 6-bit half word operands, selector 50 prevents infomiation from propagating from first 
partition 41 through selector 50 to second partition 51. Instead tt)e value on a line 59 is fonvarded to 

50 partition 51. When an "add" is being performed, a logic 0 is placed on input line 59. When a "subtract" is 
being performed, a logic 1 is placed on input line 59. 

In tiie pretended embodiment of the present invention, a common arithmetic operation performed by 
ALU 26. shown In Rgure 1, is two's complement addition. As is understood by those skilled in tiie art. the 
use of two's complement circuitry 28 to perform a two's complement on an operand before performing a 

65 two's complement addition operation in the ALU implements a two's complement subtraction. Also, the use 
of pre-shifter 27 to pre-shift an operand before performing a two's complement addition operation in the 
ALU implements a shift and add operation. 
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Backflround 

The present invention concerns parallel data processing in a single processor system. 

In general, single processor systems sequentially perform op . rations on two operands. For example, in 
5 a 32-bit connputer, each Integer operand is 32 bits. In a 64-bit computer, each integer operand is 64 bits. 
Thus an integer "add" instruction. In a 64-bit computer, adds two 64*blt integer operands to produce a 64- 
bit integer result. In most pipelined 64-bit processors, a 64-blt add instruction takes one cycle of execution 
time. 

In many instances the pertinent range of operands Is 1 6 bits or less. In current 32-bit and 64-bit 
10 computers, however, it stilt takes a full instruction to perform an operation on a pair of 16-bit operands. Thus 
the number of execution cycles required to perform an operation on two 16-blt operands is the same as the 
number of execution cycles required to perform the operation on two 32-bit operands in a 32-bit computer 
or two 64-bit operands in a 64-bit computer. 

In the prior art. parallel data processing required replicating of functional units, each functional unit able 
16 to handle full word length data. See for example, Michael Flynn, Very High-Speed Computing Systems , 
Proceedings of IEEE, Vol. 54, Number 12. December 1966, pp. 1901-1909. However, such implementa- 
tions of parallel processing Is significantly costly both in terms of hardware required and complexity In 
design. 

20 Summary of the Invention 

In accordance with the preferred embodiment of the present invention, a system is presented which 
allows parallel data processing within a single processor. In order to allow for parallel processing of data, an 
arithmetic logic unit or other operation executing entity within the processing system such as a shifter is 
25 partitioned. Within each partition operations are performed. When the operation is to be performed on full 
word length operands, there is no parallel processing. Thus data is allowed to freely propagate across 
boundaries between the partitions. When performing the operation in parallel using a plurality of less than 
one full word length operands, data is prevented from being propagated across at least one boundary 
between the partitions. 

30 For example, when the operation is an addition operation (e.g., a two's complement addition), each of 
the plurality of partitions performs an addition operation. When the addition is to be performed on full word 
length operands, carries are allowed to propagate between the partitions. When performing the addition 
operation In parallel on a plurality of less than one full word length operand sets, a carry is prevented from 
propagating across at least one boundary between the partitions. 

95 Likewise, when the operation Is a shift, each of the plurality of partitions performs a shift operation. 
When the shift is to be performed on full word length operands, shifts are allowed between the partitions. 
When performing the operation in parallel using a plurality of less than one full word length operands, a shift 
is prevented from crossing at least one boundary between the partitions. 

Also In accordance with a preferred embodiment of the present invention, a multiplier implements both 

40 multiplication of whole word multiplicands and parallel multiplication of sub-word multiplicands. Circuitry, for 
example an array of logic AND gates (or their equivalent), generates partial products. Partial product sum 
circuitry, sums the partial products to produce a result. Partial product gating means, in response to the 
selection of parallel multiplication of sub-word multiplicands, forces selected partial products to have a value 
of 0. thereby implementing parallel multiplication of sufc>-word multiplicands. When the multiplier is 

45 Implementing whole word multiplication, none of the partial products are forced to have a value of 0. The 
partial product gating means may be implemented, for example, using third inputs to at least a portion of 
the logic AND gates. 

The present invention allows for a single processor system to significantly increase performance by 
facilitating parallel processing operations when operands are less than the full word length. This Inexpensive 
50 use of parallelism results in a huge increase In performance for computations that can utilize this type of 
data parallelism without significant additional cost In silicon space on a processor chip or complexity in 
design. The present Invention also allows for parallel processing operations performed by a processor In 
response to a single instruction. 

55 Brief Description of the Drawings 

Figure 1 shows a simplified block diagram of an operation execution data path within a processor in 
accordance with preferred embodiments of the present invention. 
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@ A sysit-'v an (^tNiu*-i ^laia processing within a single processor, in order to allow for parallel processing 
of data, an .if tnm, n. ..| ut\t\ (26) or other operation executing entity within the processing system such as a 
shifter (20» •! i...M.t . w.inm each partition (41.51). operations are performed on a portion of one or more 
operands n tn. ci:« «atitjfi is to be performed on full word length operands, there is no parallel processing. 
Thus data n ai)*>A«-i i • tff«f. i, iKopaqate across boundaries between the partitions (41,51). When performing the 
operation i'« i^vatt* u\tn.i a iikiraiity of operands of less than one full word in length, data is prevented from 
being proi .i ).itio a i at i*;ast one boundary between the partitions (41,51). The principles of the present 
invention mj, ah. ti u(>iu-od lo implement a multiplier (301-316,320) which performs parallel multiplication of 

partial ^^Ou^ rr iilt (til. at'Jt 
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